Behavior of different optimizers for stochastic gradient descent. ([Source](http://ruder.io/optimizing-gradient-descent/)) — Behavior of different optimizers for stochastic gradient descent. (Source)

(Optional) Derivation

If you'd like to learn how to derive the equation that we use to approximate the gradient, please read the text below. Specifically, you'll learn how to derive

\nabla_\theta U(\theta) \approx \hat{g}= \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})

This derivation is optional and can be safely skipped.

## Likelihood Ratio Policy Gradient

We'll begin by exploring how to calculate the gradient \nabla_\theta U(\theta). The calculation proceeds as follows:

\begin{aligned}\nabla_\theta U(\theta) &= \nabla_\theta \sum_\tau P(\tau;\theta)R(\tau) & (1)\\ &= \sum_\tau \nabla_\theta P(\tau;\theta)R(\tau) & (2)\\ &= \sum_\tau \frac{P(\tau;\theta)}{P(\tau;\theta)} \nabla_\theta P(\tau;\theta)R(\tau) & (3)\\ &= \sum_\tau P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}R(\tau) & (4)\\ &= \sum_\tau P(\tau;\theta) \nabla_\theta \log P(\tau;\theta) R(\tau) & (5) \end{aligned}

First, we note line (1) follows directly from U(\theta) = \sum_\tau P(\tau;\theta)R(\tau), where we've only taken the gradient of both sides.

Then, we can get line (2) by just noticing that we can rewrite the gradient of the sum as the sum of the gradients.

In line (3), we only multiply every term in the sum by \frac{P(\tau;\theta)}{P(\tau;\theta)}, which is perfectly allowed because this fraction is equal to one!

Next, line (4) is just a simple rearrangement of the terms from the previous line. That is, \frac{P(\tau;\theta)}{P(\tau;\theta)} \nabla_\theta P(\tau;\theta) = P(\tau;\theta) \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}.

Finally, line (5) follows from the chain rule, and the fact that the gradient of the log of a function is always equal to the gradient of the function, divided by the function. (In case it helps to see this with simpler notation, recall that \nabla_x \log f(x) = \frac{\nabla_x f(x)}{f(x)}.) Thus, \nabla_\theta \log P(\tau;\theta) = \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}.

The final "trick" that yields line (5) (i.e., \nabla_\theta \log P(\tau;\theta) = \frac{\nabla_\theta P(\tau;\theta)}{P(\tau;\theta)}) is referred to as the likelihood ratio trick or REINFORCE trick.

Likewise, it is common to refer to the gradient as the likelihood ratio policy gradient:

\nabla_\theta U(\theta) = \sum_\tau P(\tau;\theta) \nabla_\theta \log P(\tau;\theta) R(\tau)

Once we’ve written the gradient as an expected value in this way, it becomes much easier to estimate.

## Sample-Based Estimate

In the video on the previous page, you learned that we can approximate the likelihood ratio policy gradient with a sample-based average, as shown below:

\nabla_\theta U(\theta) \approx \frac{1}{m}\sum_{i=1}^m \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta)R(\tau^{(i)})

where each \tau^{(i)} is a sampled trajectory.

## Finishing the Calculation

Before calculating the expression above, we will need to further simplify \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta). The derivation proceeds as follows:

\begin{aligned} \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta) &= \nabla_\theta \log \Bigg[ \prod_{t=0}^{H} \mathbb{P}(s_{t+1}^{(i)}|s_{t}^{(i)}, a_t^{(i)} )\pi_\theta(a_t^{(i)}|s_t^{(i)}) \Bigg] & (1)\\ &= \nabla_\theta \Bigg[ \sum_{t=0}^{H} \log \mathbb{P}(s_{t+1}^{(i)}|s_{t}^{(i)}, a_t^{(i)} ) + \sum_{t=0}^{H}\log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \Bigg] & (2)\\ &= \nabla_\theta\sum_{t=0}^{H} \log \mathbb{P}(s_{t+1}^{(i)}|s_{t}^{(i)}, a_t^{(i)} ) + \nabla_\theta \sum_{t=0}^{H}\log \pi_\theta(a_t^{(i)}|s_t^{(i)}) & (3)\\ &= \nabla_\theta \sum_{t=0}^{H}\log \pi_\theta(a_t^{(i)}|s_t^{(i)}) & (4)\\ &= \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) & (5) \end{aligned}

First, line (1) shows how to calculate the probability of an arbitrary trajectory \tau^{(i)}. Namely, \mathbb{P}(\tau^{(i)};\theta) = \prod_{t=0}^{H} \mathbb{P}(s_{t+1}^{(i)}|s_{t}^{(i)}, a_t^{(i)} )\pi_\theta(a_t^{(i)}|s_t^{(i)}) , where we have to take into account the action-selection probabilities from the policy and the state transition dynamics of the MDP.

Then, line (2) follows from the fact that the log of a product is equal to the sum of the logs.

Then, line (3) follows because the gradient of the sum can be written as the sum of gradients.

Next, line (4) holds, because \sum_{t=0}^{H} \log \mathbb{P}(s_{t+1}^{(i)}|s_{t}^{(i)}, a_t^{(i)} ) has no dependence on \theta, so \nabla_\theta\sum_{t=0}^{H} \log \mathbb{P}(s_{t+1}^{(i)}|s_{t}^{(i)}, a_t^{(i)} )=0.

Finally, line (5) holds, because we can rewrite the gradient of the sum as the sum of gradients.

## That's it!

Plugging in the calculation above yields the equation for estimating the gradient:

\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})